Precision annotation of digital samples in NCBI’s gene expression omnibus

نویسندگان

  • Dexter Hadley
  • James Pan
  • Osama El-Sayed
  • Jihad Aljabban
  • Imad Aljabban
  • Tej D Azad
  • Mohamad O Hadied
  • Shuaib Raza
  • Benjamin Abhishek Rayikanti
  • Bin Chen
  • Hyojung Paik
  • Dvir Aran
  • Jordan Spatz
  • Daniel Himmelstein
  • Maryam Panahiazar
  • Sanchita Bhattacharya
  • Marina Sirota
  • Mark A Musen
  • Atul J Butte
چکیده

The Gene Expression Omnibus (GEO) contains more than two million digital samples from functional genomics experiments amassed over almost two decades. However, individual sample meta-data remains poorly described by unstructured free text attributes preventing its largescale reanalysis. We introduce the Search Tag Analyze Resource for GEO as a web application (http://STARGEO.org) to curate better annotations of sample phenotypes uniformly across different studies, and to use these sample annotations to define robust genomic signatures of disease pathology by meta-analysis. In this paper, we target a small group of biomedical graduate students to show rapid crowd-curation of precise sample annotations across all phenotypes, and we demonstrate the biological validity of these crowd-curated annotations for breast cancer. STARGEO.org makes GEO data findable, accessible, interoperable and reusable (i.e., FAIR) to ultimately facilitate knowledge discovery. Our work demonstrates the utility of crowd-curation and interpretation of open 'big data' under FAIR principles as a first step towards realizing an ideal paradigm of precision medicine.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using the NCBO Web Services for Concept Recognition and Ontology Annotation of Expression Datasets

To provide enhanced access to expression datasets housed in the NCBI’s Gene Expression Omnibus database and to enable new opportunities for data mining we are using the NCBO’s Open Biomedical Annotator service to identify concepts and ontology terms in GEO records. Based on this first pass annotation we are curating these datasets using a variety of ontologies covering concepts of relevance to ...

متن کامل

Generation of transcript counts from pasilla dataset with kallisto

The pasilla dataset was produced by Brooks et al. [1]. The aim of their study was to identify exons that are regulated by pasilla protein, the Drosophila melanogaster ortholog of mammalian NOVA1 and NOVA2 (well studied splicing factors). In their RNA-seq experiment, the libraries were prepared from 7 biologically independent samples: 4 control samples and 3 samples in which pasilla was knocked-...

متن کامل

Study of Gene Expression Signatures for the Diagnosis of Pediatric Acute Lymphoblastic Leukemia (ALL) Through Gene Expression Array Analyses

Background: Acute lymphoblastic leukemia (ALL) as the most common malignancy in children is associated with high mortality and significant relapse. Currently, the non-invasive diagnosis of pediatric ALL is a main challenge in the early detection of patients. In the present study, a systems biology approach was used through network-based analysis to identify the key candidate genes related to AL...

متن کامل

Identification of Prognostic Genes in Her2-enriched Breast Cancer by Gene Co-Expression Net-work Analysis

Introduction: HER2-enriched subtype of breast cancer has a worse prognosis than luminal subtypes. Recently, the discovery of targeted therapies in other groups of breast cancer has increased patient survival. The aim of this study was to identify genes that affect the overall survival of this group of patients based on a systems biology approach. Methods: Gene expression data and clinical infor...

متن کامل

Comparative analysis of hepatocellular carcinoma and cirrhosis gene expression profiles

Gene expression data of hepatocellular carcinoma (HCC) was compared with that of cirrhosis (C) to identify critical genes in HCC. A total of five gene expression data sets were downloaded from Gene Expression Omnibus. HCC and healthy samples were combined as dataset HCC, whereas cirrhosis samples were included in dataset C. A network was constructed for dataset HCC with the package R for perfor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 4  شماره 

صفحات  -

تاریخ انتشار 2017